16 research outputs found
Data-Driven Forecasting of High-Dimensional Chaotic Systems with Long Short-Term Memory Networks
We introduce a data-driven forecasting method for high-dimensional chaotic
systems using long short-term memory (LSTM) recurrent neural networks. The
proposed LSTM neural networks perform inference of high-dimensional dynamical
systems in their reduced order space and are shown to be an effective set of
nonlinear approximators of their attractor. We demonstrate the forecasting
performance of the LSTM and compare it with Gaussian processes (GPs) in time
series obtained from the Lorenz 96 system, the Kuramoto-Sivashinsky equation
and a prototype climate model. The LSTM networks outperform the GPs in
short-term forecasting accuracy in all applications considered. A hybrid
architecture, extending the LSTM with a mean stochastic model (MSM-LSTM), is
proposed to ensure convergence to the invariant measure. This novel hybrid
method is fully data-driven and extends the forecasting capabilities of LSTM
networks.Comment: 31 page
LISA: Localized Image Stylization with Audio via Implicit Neural Representation
We present a novel framework, Localized Image Stylization with Audio (LISA)
which performs audio-driven localized image stylization. Sound often provides
information about the specific context of the scene and is closely related to a
certain part of the scene or object. However, existing image stylization works
have focused on stylizing the entire image using an image or text input.
Stylizing a particular part of the image based on audio input is natural but
challenging. In this work, we propose a framework that a user provides an audio
input to localize the sound source in the input image and another for locally
stylizing the target object or scene. LISA first produces a delicate
localization map with an audio-visual localization network by leveraging CLIP
embedding space. We then utilize implicit neural representation (INR) along
with the predicted localization map to stylize the target object or scene based
on sound information. The proposed INR can manipulate the localized pixel
values to be semantically consistent with the provided audio input. Through a
series of experiments, we show that the proposed framework outperforms the
other audio-guided stylization methods. Moreover, LISA constructs concise
localization maps and naturally manipulates the target object or scene in
accordance with the given audio input
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation,
which unifies pre-trained text-image diffusion and discriminative models to
perform open-vocabulary panoptic segmentation. Text-to-image diffusion models
have shown the remarkable capability of generating high-quality images with
diverse open-vocabulary language descriptions. This demonstrates that their
internal representation space is highly correlated with open concepts in the
real world. Text-image discriminative models like CLIP, on the other hand, are
good at classifying images into open-vocabulary labels. We propose to leverage
the frozen representation of both these models to perform panoptic segmentation
of any category in the wild. Our approach outperforms the previous state of the
art by significant margins on both open-vocabulary panoptic and semantic
segmentation tasks. In particular, with COCO training only, our method achieves
23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute
improvement over the previous state-of-the-art. Project page is available at
https://jerryxu.net/ODISE .Comment: CVPR 2023. Project page: https://jerryxu.net/ODIS
Convolutional State Space Models for Long-Range Spatiotemporal Modeling
Effectively modeling long spatiotemporal sequences is challenging due to the
need to model complex spatial correlations and long-range temporal dependencies
simultaneously. ConvLSTMs attempt to address this by updating tensor-valued
states with recurrent neural networks, but their sequential computation makes
them slow to train. In contrast, Transformers can process an entire
spatiotemporal sequence, compressed into tokens, in parallel. However, the cost
of attention scales quadratically in length, limiting their scalability to
longer sequences. Here, we address the challenges of prior methods and
introduce convolutional state space models (ConvSSM) that combine the tensor
modeling ideas of ConvLSTM with the long sequence modeling approaches of state
space methods such as S4 and S5. First, we demonstrate how parallel scans can
be applied to convolutional recurrences to achieve subquadratic parallelization
and fast autoregressive generation. We then establish an equivalence between
the dynamics of ConvSSMs and SSMs, which motivates parameterization and
initialization strategies for modeling long-range dependencies. The result is
ConvS5, an efficient ConvSSM variant for long-range spatiotemporal modeling.
ConvS5 significantly outperforms Transformers and ConvLSTM on a long horizon
Moving-MNIST experiment while training 3X faster than ConvLSTM and generating
samples 400X faster than Transformers. In addition, ConvS5 matches or exceeds
the performance of state-of-the-art methods on challenging DMLab, Minecraft and
Habitat prediction benchmarks and enables new directions for modeling long
spatiotemporal sequences
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
In recent years, video generation has become a prominent generative tool and
has drawn significant attention. However, there is little consideration in
audio-to-video generation, though audio contains unique qualities like temporal
semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to
incorporate audio input that includes both changeable temporal semantics and
magnitude. To generate video frames, TPoS utilizes a latent stable diffusion
model with textual semantic information, which is then guided by the sequential
audio embedding from our pretrained Audio Encoder. As a result, this method
produces audio reactive video contents. We demonstrate the effectiveness of
TPoS across various tasks and compare its results with current state-of-the-art
techniques in the field of audio-to-video generation. More examples are
available at https://ku-vai.github.io/TPoS/Comment: ICCV202
Sound-Guided Semantic Video Generation
The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent
space is useful for realistic video generation. However, the generated motion
in the video is usually not semantically meaningful due to the difficulty of
determining the direction and magnitude in the StyleGAN latent space. In this
paper, we propose a framework to generate realistic videos by leveraging
multimodal (sound-image-text) embedding space. As sound provides the temporal
contexts of the scene, our framework learns to generate a video that is
semantically consistent with sound. First, our sound inversion module maps the
audio directly into the StyleGAN latent space. We then incorporate the
CLIP-based multimodal embedding space to further provide the audio-visual
relationships. Finally, the proposed frame generator learns to find the
trajectory in the latent space which is coherent with the corresponding sound
and generates a video in a hierarchical manner. We provide the new
high-resolution landscape video dataset (audio-visual pair) for the
sound-guided video generation task. The experiments show that our model
outperforms the state-of-the-art methods in terms of video quality. We further
show several applications including image and video editing to verify the
effectiveness of our method
Image Analysis with Long Short-Term Memory Recurrent Neural Networks
Computer Vision (CV) problems, such as image classification and segmentation, have traditionally been solved by manual construction of feature hierarchies or incorporation of other prior knowledge. However, noisy images, varying viewpoints and lighting conditions of images, and clutters in real-world images make the problem challenging. Such tasks cannot be efficiently solved without learning from data. Therefore, many Deep Learning (DL) approaches have recently been successful for various CV tasks, for instance, image classification, object recognition and detection, action recognition, video classification, and scene labeling. The main focus of this thesis is to investigate a purely learning-based approach, particularly, Multi-Dimensional LSTM (MD-LSTM) recurrent neural networks to tackle the challenging CV tasks, classification and segmentation on 2D and 3D image data. Due to the structural nature of MD-LSTM, the network learns directly from raw pixel values and takes the complex spatial dependencies of each pixel into account. This thesis provides several key contributions in the field of CV and DL.
Several MD-LSTM network architectural options are suggested based on the type of input and output, as well as the requiring tasks. Including the main layers, which are an input layer, a hidden layer, and an output layer, several additional layers can be added such as a collapse layer and a fully connected layer. First, a single Two Dimensional LSTM (2D-LSTM) is directly applied on texture images for segmentation and show improvement over other texture segmentation methods. Besides, a 2D-LSTM layer with a collapse layer is applied for image classification on texture and scene images and have provided an accurate classification results. In addition, a deeper model with a fully connected layer is introduced to deal with more complex images for scene labeling and outperforms the other state-of-the-art methods including the deep Convolutional Neural Networks (CNN). Here, several input and output representation techniques are introduced to achieve the robust classification. Randomly sampled windows as input are transformed in scaling and rotation, which are integrated to get the final classification. To achieve multi-class image classification on scene images, several pruning techniques are introduced. This framework provides a good results in automatic web-image tagging. The next contribution is an investigation of 3D data with MD-LSTM. The traditional cuboid order of computations in Multi-Dimensional LSTM (MD-LSTM) is re-arranged in pyramidal fashion. The resulting Pyramidal Multi-Dimensional LSTM (PyraMiD-LSTM) is easy to parallelize, especially for 3D data such as stacks of brain slice images. PyraMiD-LSTM was tested on 3D biomedical volumetric images and achieved best known pixel-wise brain image segmentation results and competitive results on Electron Microscopy (EM) data for membrane segmentation.
To validate the framework, several challenging databases for classification and segmentation are proposed to overcome the limitations of current databases. First, scene images are randomly collected from the web and used for scene understanding, i.e., the web-scene image dataset for multi-class image classification. To achieve multi-class image classification, the training and testing images are generated in a different setting. For training, images belong to a single pre-defined category which are trained as a regular single-class image classification. However, for testing, images containing multi-classes are randomly collected by web-image search engine by querying the categories. All scene images include noise, background clutter, unrelated contents, and also diverse in quality and resolution. This setting can make the database possible to evaluate for real-world applications. Secondly, an automated blob-mosaics texture dataset generator is introduced for segmentation. Random 2D Gaussian blobs are generated and filled with random material textures. These textures contain diverse changes in illumination, scale, rotation, and viewpoint. The generated images are very challenging since they are even visually hard to separate the related regions.
Overall, the contributions in this thesis are major advancements in the direction of solving image analysis problems with Long Short-Term Memory (LSTM) without the need of any extra processing or manually designed steps. We aim at improving the presented framework to achieve the ultimate goal of accurate fine-grained image analysis and human-like understanding of images by machines